(
)
pp
g
(
)
e test was used for the generalisation test. Figure 7.17 (b) shows
curves as well as AUC values of three models. The AUC values
ost one, meaning a perfect discrimination power between SARS-
SARS-CoV-2 genomes based on this alignment-free multiple
comparison.
ole genome pattern discovery for SARS-CoV-2
le genome pattern discovery for SARS-CoV-2 is shown in this
Both sequences and metadata were downloaded from the Global
on Sharing All Influenza Data [Shu and McCauley, 2017] on the
ry 2021. There were 315,253 sequences across over 200 countries
ns in total till that date.
countries with the highest infection numbers were selected for the
tion in this chapter. They were USA, India, Russia and Brazil.
ere 57,836, 4,325, 1,572 and 1,811 sequences for USA, India,
nd Brazil, respectively. After removing duplicated sequences,
e 51,383, 4,321, 1,502 and 1,771 sequences left for USA, India,
nd Brazil, respectively. There were finally 58,897 sequences from
ntries. Each sequence was coded using the 3-mer approach.
e, a numeric vector of 64 3-mers or words for each sequence, i.e.,
∈࣬ସ.
types of models were constructed for this whole genome pattern
y problem. First, unsupervised machine learning models were
ed to visualise how the genomic patterns (the 3-mer word
y library) of the viral sequences were distributed in these four
. What this analysis aims to do is to examine whether the set of
s of all sequences (X) can be efficiently and accurately divided
subsets, namely Ωௌ, Ωூௗ, Ω௭ and Ωோ௨௦௦, using a
ised machine learning model, ݂ሺ܆ሻ.
݂ሺ܆ሻ⟹⋃ሼΩௌ, Ωூௗ, Ω௭, Ωோ௨௦௦ሽ